A simple sketching algorithm for entropy estimation over streaming data
نویسندگان
چکیده
We consider the problem of approximating the empirical Shannon entropy of a highfrequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Rényi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable data sketch via the method of compressed counting. An approximation to the Shannon entropy can be obtained from the Rényi entropy by taking α sufficiently close to 1. However, practical guidelines for parameter calibration with respect to α are lacking. We avoid this problem by showing that the random variables used in estimating the Rényi entropy can be transformed to have a proper distributional limit as α approaches 1: the maximally skewed, strictly stable distribution with α = 1 defined on the entire real line. We propose a family of asymptotically unbiased log-mean estimators of the Shannon entropy, indexed by a constant ζ > 0, that can be computed in a single-pass algorithm to provide an additive approximation. We recommend the log-mean estimator with ζ = 1 that has exponentially decreasing tail bounds on the error probability, asymptotic relative efficiency of 0.932, and near-optimal computational complexity. Appearing in Proceedings of the 16 International Conference on Artificial Intelligence and Statistics (AISTATS) 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 by the authors.
منابع مشابه
Sketching and Streaming High-Dimensional Vectors
A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-call...
متن کاملOnline Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملStreaming Anomaly Detection Using Randomized Matrix Sketching
Data is continuously being generated from sources such as machines, network traffic, application logs, etc. Timely and accurate detection of anomalies in massive data streams have important applications in preventing machine failures, intrusion detection, and dynamic load balancing. In this paper, we introduce a new anomaly detection algorithm, which can detect anomalies in a streaming fashion ...
متن کاملUsing an Evaluator Fixed Structure Learning Automata in Sampling of Social Networks
Social networks are streaming, diverse and include a wide range of edges so that continuously evolves over time and formed by the activities among users (such as tweets, emails, etc.), where each activity among its users, adds an edge to the network graph. Despite their popularities, the dynamicity and large size of most social networks make it difficult or impossible to study the entire networ...
متن کاملA simple sketching algorithm for entropy estimation
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream when space limitations make exact computation infeasible. It is known that αdependent quantities such as the Rényi and Tsallis entropies can be estimated efficiently and unbiasedly from low-dimensional α-stable data sketches. An approximation to the Shannon entropy can be obtained from either ...
متن کامل